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Abstract 

Biochemical activity and core stability are essential properties of proteins, maintained usually by conserved amino acids. 
Structural dynamics emerged in recent years as another essential aspect of protein functionality. Structural dynamics 
enable the adaptation of the protein to binding substrates and to undergo allosteric transitions, while maintaining the 
native fold. Key residues that mediate structural dynamics would thus be expected to be conserved or exhibit 
coevolutionary patterns at least. Yet, the correlation between sequence evolution and structural dynamics is yet to be 
established. With recent advances in efficient characterization of structural dynamics, we are now in a position to perform 
a systematic analysis. In the present study, a set of 34 enzymes representing various folds and functional classes is analyzed 
using information theory and elastic network models. Our analysis shows that the structural regions distinguished by their 
coevolution propensity as well as high mobility are predisposed to serve as substrate recognition sites, whereas residues 
acting as global hinges during collective dynamics are often supported by conserved residues. We propose a mobility scale 
for different types of amino acids, which tends to vary inversely with amino acid conservation. Our findings suggest the 
balance between physical adaptability (enabled by structure-encoded motions) and chemical specificity (conferred by 
correlated amino acid substitutions) underlies the selection of a relatively small set of versatile folds by proteins. 
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Introduction 

The role of structural dynamics in enabling protein func- 
tion has been underlined in recent work (Bhabha et al. 
2011). In some cases, dynamics is manifested by large-scale 
collective motions of intact substructures. Examples are the 
opening/closing of domains around a catalytic cleft or the 
allosteric switches that cooperatively engage multiple sub- 
units in multimeric structures. Many enzymes and molec- 
ular machines such as the bacterial chaperonin or the 
ribosome, or multimeric membrane proteins involved in 
allosteric signaling or transport, undergo such concerted 
motions triggered by substrate binding (Tama and Brooks 
2006; Yang et al. 2009; Bahar et al. 2010). These are usually 
referred to as "global motions" due to their collective 
nature. In other cases, the motions are "local," for example, 
rearrangements of recognition loops or rotational isomer- 
izations of side chains. 

Global motions are predominantly encoded by the ar- 
chitecture of the protein. Models based exclusively on na- 
tive contact topology, such as elastic network models, have 
proven to closely reproduce the structural variabilities ob- 
served in experiments for proteins resolved in multiple 
substrate-bound forms (Bakan and Bahar 2009; Bahar 
et al. 2010). The fact that these motions are uniquely 
and robustly defined by the architecture suggests that na- 
tive folds may have evolved to favor functional motions. 
This also suggests that there are key mechanical sites that 
control the global movements while preserving the stability 
of the fold. To date, no systematic study of the evolutionary 
conservation properties of amino acids in relation to the 



structure-encoded dynamics of proteins has been per- 
formed to our knowledge. 

Local motions, on the other hand, may facilitate the rec- 
ognition of substrates, optimize binding interactions, and 
allow for gate opening/closing, usually complementing 
global motions (e.g., domain closure) or accompanying 
structure formation upon substrate binding (Wright and 
Dyson 2009). Substrate recognition sites tend to exhibit 
suitable sequence variations so as to enable specific recog- 
nition (Liu et al. 2010); and at the same time, they may 
enjoy structural flexibility, consistent with conformational 
adaptability required for mediating substrate specificity 
(James et al. 2003). In contrast, conserved residues are 
highly ordered, as evidenced by nuclear magnetic reso- 
nance relaxation experiments (Mittermaier et al. 2003). 
Our examination of the collective dynamics of catalytic 
sites (Yang and Bahar 2005) and metal-binding sites (Dutta 
and Bahar 2010) also showed that residues involved in bio- 
chemical activities exhibit minimal fluctuations. 

All these observations suggest that sequence variability 
and structural dynamics go hand in hand; and recent stud- 
ies highlight the importance of combining information on 
evolutionary conservation and structural dynamics in order 
to recognize proteins that share functional properties 
(Tang and Altman 201 1 ). The prevalence of such a relation- 
ship between sequence evolution and structural dynamics 
remains to be analytically investigated and established. 

In the present study, we present the results from the 
analysis of 34 enzymes that represent a diversity of protein 
families, functional classes, and sizes (supplementary table 
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S1, Supplementary Material online). For each enzyme, we 
determined the relative mobility each residue enjoys in the 
collective dynamics, on the one hand, and the amino acid 
conservation or correlated mutation propensities at the 
corresponding sequence position, on the other. Our anal- 
ysis shows that 1) conserved residues have minimal fluctu- 
ations in the global modes, their high stability presumably 
being a prerequisite for their precise functioning, 2) in- 
crease in sequence variability is accompanied with increase 
in conformational mobility, this feature being most distinc- 
tive at intermediate levels of conservation/mobility typical 
of coevolving pairs of amino acids, 3) the coevolving res- 
idues, identified after removing effects originating from 
common ancestry, fall into two groups: those involved 
in substrate recognition and others in the neighborhood 
of substrate-binding sites, presumably assisting in substrate 
stabilization or signal transmission (the former group is dis- 
tinguished by its enhanced mobility in the global modes of 
the enzyme), and 4) it is possible to define an intrinsic mo- 
bility scale for the 20 types of amino acids, which might be 
utilized for customizing protein dynamics. 

Materials and Methods 

Enzyme Data Set 

The data set used in a previous study (Zen et al. 2008) was 
adopted as starting point. This data set contained 76 en- 
zymes with a broad range of functions. Among them, we 
focused on the monomeric X-ray structures that contained 
at least 120 structurally resolved residues. For each enzyme, 
the multiple sequence alignment (MSA) retrieved from the 
Pfam database (Finn et al. 2008) was refined using the fol- 
lowing procedure: 1) iteratively align the primary (query) 
sequence from the Protein Data Bank (PDB) with each se- 
quence in the MSA using the Smith-Waterman algorithm 
(Smith and Waterman 1981) and identify a "matched" se- 
quence with the highest score, which shares at least 95% 
sequence identity the PDB sequence; 2) based on the res- 
idue mapping between the PDB sequence and the matched 
sequence, truncate the columns of the MSA so as to retain 
those residues structurally resolved in the PDB sequence; 
and 3) remove the redundant sequences in the refined 
MSA using a threshold of 99% and eliminate the sequences 
that have more than 20% gaps. Supplementary table S1 
(Supplementary Material online) lists the specifications 
of the proteins in the resulting data set. 

Structural Dynamics 

Gaussian network model (GNM) analysis was performed 
according to the well-established protocol described in 
our previous work (Bahar et al. 1997; Yang et al. 2006), with 
a cutoff distance of 7.3A. For an enzyme of N residues, each 
GNM mode k is represented by an N-dimensional eigen- 
vector, and eigenvalue X kl describing the mode shape 
and frequency (squared), respectively. The /th element 
[u (/c) ], of u (/c) describes the displacement of residue / along 
the kth mode axis. By definition, eigenvectors are normal- 
ized, that is, the plot of [n^]f as a function of residue index 



/ also represents the normalized distribution of square dis- 
placements in mode k. The reciprocal 5^T 1 serves as the 
weight of mode k, such that the slow modes, also called 
soft modes, make the largest contributions to observed 
dynamics. The fractional contribution (or probability) of 
m modes to the overall dynamics is given by w(m) = 
ET=i VVEk=i V 1 - The corresponding weighted- 
average mobility of residue / is defined as <A/l/>| m = 

The "mobility profile" driven by m modes for a given 
protein is obtained by plotting <M,>\ m as a function of 
residue index /. Minima in the mobility profile refer to sites 
that exhibit minimal "translational" movements in the col- 
lective motions. Note that a region may be structurally con- 
strained (e.g., in the core of a domain) but exhibit high 
mobility (being embedded in a 'moving' domain), or vice 
versa, that is, a region that enjoys some local (e.g., rota- 
tional) flexibility may exhibit minimal mobility if it main- 
tains its spatial position during the collective dynamics of 
the protein (thus acting as a hinge/anchor). 

Conservation and Coevolution Patterns 
The tolerance of sequence position / to mutations or 
amino acid substitutions is measured by the Shannon 
information entropy (Cover and Thomas 1991) 
S(/)= — Ea°=i P( a /)l°gP( a /)/ where P(aj) is the probability 
(or fraction) of occurrence of amino acid type a at the /th 
column of the MSA. S(i) varies in the range 0 < S(/) < 
ln(20) = 3.0, the lower and upper limits corresponding 
to fully conserved and fully random (equal probability 
of all twenty amino acid types) amino acids at the /th po- 
sition. Gaps in each column are treated as uniformly distrib- 
uted amino acids. 

Using a similar notation, the coevolution propensity of 
the amino acids at the /th and jth positions along the se- 
quence is given by the mutual information (Ml), 

'(' J)= E«,=i E^ =1 K a h b j) [o ^^^-y where p ( a » b i) des ' 
ignates the joint probability of observing amino acid types 
a and b and the respective sequence positions / and j (Liu 
et al. 2008; Dunn et al. 2008). Here gaps are treated as the 
21st amino acid type. The Ml profile for each sequence po- 
sition /, <!(/)>, is calculated by taking the average 
</(/)>= /(/,j)/N, where the summation is per- 
formed over all j ^ i. 

The inclusion of related sequences in the MSA may lead 
to an overestimation of conservation and coevolution pro- 
pensities by including those amino acids that retain their 
identity due to common ancestry rather than selective 
functional requirements. This (phylogeny) effect is partic- 
ularly important in the evaluation of Ml values. For exam- 
ple, our analysis of HIV-1 protease sequences clearly 
showed two classes of correlated pairs, attributed to phy- 
logeny and multidrug resistance, respectively (Liu, Eyal, and 
Bahar 2008), and it is important to filter out ancestral ef- 
fects so as to assess functional correlations. Several meth- 
ods have been proposed to reduce the effect of shared 
ancestry in evaluating coevolutionary patterns (Atchley 
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conservation/co-evolution 




mobility 




Fig. 1. Workflow of the study. For each query enzyme in the data set, we retrieve the structure from the PDB and the MSA from Pfam database. 
These are used as input for 1) GNM evaluation of residue mobilities (right branch) and 2) generation of conservation profile and coevolution 
maps (left branch), respectively. Comparison of the outputs shows that sequence entropy is accompanied by conformational mobility 
(enhanced dynamics), correlated mutations exhibit a broad range of mobilities depending on the type of underlying evolutionary pressure, and 
conserved sites are practically immobile. Statistically significant results are obtained by compiling the outputs for 34 enzymes. 



et al. 2000; Tillier and Lui 2003; Dunn et al. 2008). Here we 
adopt the average product correction (APC) proposed by 
Gloor and coworkers, which proved to dramatically im- 
prove residue contact predictions based on coevolution 
statistics. Accordingly, the corrected Ml, designated as 
Mlp, is evaluated using the expression Mlp(/, j) = l(/, j) 
- APC(/, j), where the correction term is given by APC(/, 
j) = [<|(/)> <\Q)>]/<\(i, j)>. Here, <\(i, j)> is the av- 
erage over all Ml values evaluated for a given MSA. Subtrac- 
tion of the APC(7,j) lowers Ml values in general, especially at 
residue pairs whose average Ml values are generally high, 
allowing to distinguish the pairwise covariations devoid 
of the inherent variation/conservation properties of the in- 
dividual residues. Although the absolute strength of the sig- 
nals are weakened in general, the evaluation of Mlp(/, j) 
allows for better discrimination/visualization of the sites 
that undergo correlated mutations upon removal of phy- 
logenetic effects specific to the examined protein family. 

The above metrics have been normalized to introduce 
a uniform numerical scale for the profiles among different 
proteins and detect recurrent patterns. To this aim, the 
normalized distributions were uniformly multiplied by the 



number of residues so as to eliminate the dependence of 
the resulting mobility/conservation/coevolution profiles on 
the size of the proteins. Effective properties were assessed us- 
ing a grid-based mapping scheme. The basic idea therein is 
to cluster residues with similar entropy (using bin sizes of 
AS = 0.1) and evaluate the average mobility within each 
cluster. 

Mobility, Conservation, and Coevolution 
Propensities as a Function of Amino Acid Types 
We extracted three subsets of amino acids distinguished by 
their high conservation, high coevolution, and high mobil- 
ity properties, shortly designated as C-, £-, and A/l-sites, re- 
spectively. The propensity of a given amino acid (of type a) 
to belong to the subset of X-sites (X = C, E, or A/I) is given by 
the ratio, Px(a) = [N a , x /N x ]/N a /N total ]. Here N a and N a>x 
denote the numbers of amino acids of type a in the com- 
plete dataset of 34 enzymes and in the subset X respec- 
tively, N total =£f=iN a and N x = £f =1 N a #. More 
details on the evaluation of amino acid propensities is 
presented in the supplementary text S1 (Supplementary 
Material online), and protocols for evaluating mobility 
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Fig. 2. An illustrative example: comparative analysis of residue conservation, conformational mobility, and coevolutionary patterns for UDG. 
(a) Mobility and conservation profiles as a function of residue index. Blue, red, and black curves represent the mobility profiles <M/>| ml , 
<Mj>\ m2 , and <M,>\ N _ ^ (or MSFs) computed using the GNM. The curves are shifted vertically for clarity. The bars represent the 
information entropy derived from 1599 Pfam sequences (supplementary table SI, Supplementary Material online). Results are shown for the 
structurally resolved residues 131 < / < 292 that are fully represented in the MSA. (b) Comparison of conservation (upper) and mobility 
(lower) profiles using color-coded ribbon diagrams, (c) Mlp map for the UDG family (see supplementary fig. S2, Supplementary Material online, 
for the corresponding Ml map). The magnified portion refers to the DNA-binding region of UDG. Highest signals are detected at M1 31 -1 134, 
P163, V164, I181-F184, C281, and H283. (d) Location of residues distinguished by high Mlp values at DNA-binding site. The diagram is color 
coded based on the crystallographic B factors (red/blue: most/least mobile) reported for UDG. 



profiles, sequence entropy profiles, and Ml maps may be 
found in our previous work (Liu et al. 2008, 2010). 

Results and Discussion 

Overview 

Figure 1 illustrates the method of approach. We adopted 
a two-pronged analysis for each enzyme: 1 ) perform a GNM 
(Bahar et al. 1997) analysis of collective dynamics using the 
PDB structures and 2) analyze the residue conservation and 
coevolution properties using the approach described in 
Materials and Methods. 

The GNM analysis yields a "mobility profile" for each 
enzyme. N — 1 GNM modes of motion contribute to 
the structural dynamics of an enzyme of N residues. The 
mobility profile based on all modes, <M,>\ N _ V scales with 
the mean square fluctuations (MSFs) of residues. The low 
frequency modes, also called the "soft" modes or global 
modes, play a dominant role in defining the most cooper- 
ative events. We examined the contribution made by these 
modes to MSFs. To this aim, we considered the profiles 
<Mj>\ m i and <Mj>\ m2 associated with ml and ml 



modes at the low-frequency end of the spectrum, which 
make fractional contributions of 0.1 and 0.4, respectively, 
to collective dynamics (see supplementary table S2, Supple- 
mentary Material online). 

The MSAs are utilized to generate the "conservation 
profile" (S(/) as a function of residue index /) and "coevo- 
lution maps" (both and Mlp(/, j) as a function of / and 
j) for each protein. Comparison of mobility profiles and 
conservation/coevolution trends for each enzyme, consol- 
idated over the entire dataset, discloses three different clas- 
ses of residues based on their mobility/evolution behavior. 
Conserved residues distinguished by S(i) values below 
a threshold undergo minimal changes in their positions 
in the 3D structure. Conversely, the sites that exhibit un- 
corrected variations in their amino acid identity display 
enhanced mobilities, although the extent of mobility 
broadly varies. The intermediate regime exhibits a linear 
increase in mobility with increasing sequence entropy. 
Most coevolving residues fall in this regime. The results 
highlight the importance of structural adaptability in sus- 
taining the functional dynamics of the enzyme notwith- 
standing sequence variations that confer specificity. 
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An Illustrative Example 

For clarity, some of the basic steps and outcomes are illus- 
trated for a DNA repair enzyme, uracil-DNA glycosylase 
(UDG), in figure 2. Panel a displays the mobility profiles 
based on ml, m2, and N — 1 GNAA modes. In UDG, 
ml = 1, that is, the softest mode alone accounts for 
>10% of the dynamics (see highlighted entry in supplemen- 
tary table S2, Supplementary Material online). </Vl,>| ml 
shows the distribution of square displacements of residues 
in this softest mode. <A/1, > | N - 1 scales with the MSF profile 
of residues and contains contributions from both global and 
local motions; yet the shape of the curve is dominated by 
slow/soft modes as the close resemblance to <A/l/>| 
veals. The gray bars in figure 2a represent the Shannon en- 
tropy profile. Peaks represent the most variable sites, and 
minima, the most conserved. Notably, mobility and entropy 
distributions exhibit similarities, as also evidenced by the 
color-coded ribbon diagrams displayed in panel b. The rela- 
tion between sequence variations and structural dynamics at 
the level of individual residues is clearly seen by evaluating 
the "effective mobilities" based on entropy bins of AS = 0.1 
and compiling the results for all enzymes in our data set. The 
resulting <A/lf f >| N _ 1 values yielded a correlation of 0.82 
with sequence entropy, whereas the plot for individual res- 
idues gave a correlation of 0.52 (supplementary fig. S1, Sup- 
plementary Material online). This observation underscores 
the significance of consolidating the outputs with an ensem- 
ble of proteins, rather than examining single proteins where 
the patterns may be barely detectable. 

The Mlp map in figure 2c displays the coevolutionary 
properties of UDG residue pairs. Red regions indicate 
the pairs that exhibit the highest Mlp values, that is, the 
loci of most correlated mutations. The upper right portion 
of the map magnified in panel c reveals the high coevolu- 
tionary properties of residues near DNA-binding site, 
shown in panel d. Supplementary figure S2 (Supplementary 
Material online) shows that the corresponding Ml map ex- 
hibits similar features. Comparison of Ml and Mlp maps 
shows that in general signals are weakened in the Mlp 
map (due to complete removal of contributions potentially 
arising from common ancestry). However, those at the 
DNA-binding regions highlighted in panel d are maintained 
and become even more distinctive in the background of 
weakened correlations. 

Our previous examination of the sequence evolution 
properties of Hsp70 ATPase domain in relation to its intrin- 
sic dynamics suggested that among coevolving residues 
those distinguished by high mobility in the global modes 
serve as substrate recognition sites (Liu et al. 2010). Many 
coevolving residues in Hsp70 have been reported to be in- 
volved in allosteric responses (Smock et al. 2010). The role of 
structurally labile, sequentially correlated residues in sub- 
strate recognition was also pointed out for PDZ domains 
by Kosik and coworkers (Sakarya et al. 2010). E182, D183, 
R276-F279, and C281-H283 appear to be such residues in 
UDG (fig. 2a). Notably, as evidenced by the structure shown 
in figure 2d, the residues R276-G282 do interact with DNA, 



and I181-F184 distinguished by their high Mlp (and Ml) sig- 
nals are also suggested here to be interacting with DNA due 
to their spatial location adjacent to the former group and 
also their high coevolutionary propensity. 

Sequence Entropy versus Conformational Mobility 
for All Enzymes 

We repeated the comparative analysis summarized for UDG 
for all 34 enzymes in our data set. The results, compiled in 
supplementary table S3 (Supplementary Material online), 
confirm that mobilities/restrictions and sequence variabil- 
ities/conservations exhibit weak but statistically significant 
correlations; and these correlations become apparent when 
the effective entropies for the complete set of enzymes are 
consolidated based on entropy bins of AS = 0.1 (as opposed 
to plotting individual values). Figure 3a shows the results for 
all the 8,254 residues in our data set. The curves show the 
best fit to effective mobility <A/lf ff >| m1 and MSFs (or 
<A/1f ff >| N _ 1 ). The number distribution of residues in each 
entropy interval is shown by the histogram (gray bars). 

Several interesting features are observed in figure 3a. 
First, the coupling between structural dynamics and se- 
quence variability is more pronounced when the global 
motions driven by a few soft modes (ml = 1-2; supple- 
mentary table S2, Supplementary Material online) are ex- 
amined, as opposed to the resultant of all N — 1 modes. 

Second, this dependence is not linear. Higher sequence 
entropy (or lower conservation) is accompanied by in- 
creased mobility as expected, but this increase does not 
take effect until the entropy reaches a threshold value 
of S, ~ 0.8 (orange arrow). In the range Sj < 0.8, the global 
mobility is minimal with little dependency on the conser- 
vation level. About 30% of residues lie in this low-mobility/ 
high conservation regime. Then, there is a sharp increase in 
mobility tied in with decrease in entropy. Sequence vari- 
ability above this threshold value cannot presumably be 
sustained unless the global dynamics endows suitable 
structural flexibility. In the other extreme case of high en- 
tropy regime (S, > 1.5, delimited by green arrow), residues 
exhibit a broad variation in their mobility, partly due to the 
scarcity of data (9% of residues lie in this regime). Therefore, 
we distinguish three regimes, with the strictest dependence 
on mobility manifested at the intermediate level 0.8 < S, < 
1.5 of sequence entropy. 

Third, the histogram for entropy (gray bars in fig. 3a) 
exhibits a unique behavior with a peak at the most con- 
served region (leftmost bar), thus departing from a unim- 
odal distribution. This peak refers to fully conserved 
residues. The size of this group (322 residues) is much larger 
than that expected for a normal distribution tail. GNM cal- 
culations confirm that this subgroup of residues exhibit 
minimal fluctuations (supplementary fig. S3, Supplemen- 
tary Material online). In contrast, the most variable group 
(the rightmost bar in the histogram) contains 117 residues 
that span a wide range (1.9 < S, < 2.9) of entropy and 
effective mobility, preferentially sampling larger fluctua- 
tions in space (supplementary fig. S3, Supplementary 
Material online). 
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Fig. 3. Relationship between structural dynamics and sequence evolution properties, (a) Effective mobility as a function of sequence 
conservation, based on softest modes (red circles) or N — 1 modes (open circles) computed for all residues in the data set of 34 enzymes. The 
curves are the weighted least square fits to computed data, with respective correlation coefficients of 0.90 and 0.95. The number distribution of 
residues in different entropy intervals is shown by the gray bars (right ordinate). Entries with S, > 2 are merged in the last bin. Arrows delimit 
distinctive mobility versus conservation regimes, (b) Sequence entropy distribution for all residues (orange) and a subset distinguished by their 
high coevolution propensities (cyan), (c) Mobility histograms for three groups of residues, as labeled. Respective mean values and variances are 
1.00 ± 0.134, 0.79 ± 0.059, and 1.06 ± 0.127. 



Residues Distinguished by Coevolutionary 
Properties Usually Exhibit Intermediate Levels of 
Mobility 

We evaluated the Ml and AAlp maps for all the enzymes in 
our data set and identified the residues that yielded the 
strongest coevolution signals. Figure 3b and c show the re- 
spective conservation and mobility distributions (cyan 
bars) evaluated for the residues that yielded the top 
20% <!(/)> values (1,639 of them), referred to as highly 
coevolving residues. Panel b compares their sequence en- 
tropy distribution to that of the entire residue set (orange). 
Notably, a large majority (82%) of highly coevolving resi- 
dues fall in the intermediate entropy regime identified 
above. And the distributions in figure 3c show that these 
residues tend to enjoy larger mobilities compared with 'all' 
residues. Calculations repeated with AAlp values confirmed 
the same trend, with a slight shift of the overall distribution 
toward higher entropy (and higher mobility) regime, con- 
sistent with the elimination of a number of residue pairs 
that appear to covary due to their common ancestry 
(fig. S4). Panel c also displays the histogram (green) for 
the most conserved sites, referred to as C-sites (lowest 
20% S(/) values), again showing their lower mobility com- 
pared with all residues. 

Substrate Recognition Is Assisted by Coevolving 
Residue Pairs that Enjoy Enhanced Global Mobility 
Coevolution of amino acids appears to enable the adapt- 
ability of ubiquitous proteins or their modular domains to 
cope with diverse substrates (Gotoh 1992; Liu et al. 2008, 



2010; Xu et al. 2009; Smock et al. 2010). Our earlier study 
invited attention to the enhanced global mobility of such 
sites involved in substrate recognition (Liu et al. 2010). Ob- 
servations made here further support this notion. 

Figure 4 illustrates the results for procathepsin B 
(Podobnik et al. 1997). Results for other proteins (staphy- 
lococcal nuclease, T7 lysozyme, carbonic anhydrase II, and 
carboxypeptidase A) may be seen in the supplementary 
figures S5 and S6 and table S4 (Supplementary Material on- 
line). In all cases, the "highest" peaks in the global mode 
include residues distinguished by their high coevolution 
propensities (indicated by squares on the global mobility 
curves), and these residues are noted to assist in substrate 
recognition. Figure 4 shows that in procathepsin the resi- 
dues distinguished by their strong Mlp values are mainly 
clustered in the occluding loop N113-T125 that is involved 
in substrate recognition (Illy et al. 1997) and inhibitor bind- 
ing (Renko et al. 2010). The high coevolution propensity of 
this loop is apparent even by examining the Ml map 
(supplementary fig. S7, Supplementary Material online). 
Figure 4b shows the pronounced mobility of this loop in 
the global mode of motion of the enzyme. 

It is worth noting that apart from the residues involved 
in substrate binding, those located near active sites (e.g., 
catalytic or signal transduction sites) may also exhibit co- 
evolutionary trends, if they are not conserved. Binding and 
signaling are achieved more efficiently in the case of tight 
packing and minimal energy dissipation or residue fluctu- 
ations in the global modes. The inhibitor-bound structure 
of cathepsin B (Renko et al. 2010) presents such sites (S65, 
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Fig. 4. Sequence coevolution and high mobility properties at the ligand recognition site of procathepsin B catalytic domain, (a) Ml map, 
highlighting (in red) the coevolving amino acid pairs. Residues corresponding to the top 0.05% Mlp values (N47, A48, S65, M66, 1105, C108, 
N113-P118, T120-G123, T125, A248, and G249) are indicated by squares on the <A/l- ff >| ml curve, color. They are shown by spheres in the 
ribbon diagram for the complex formed with stefin A (cyan), (b) Global mobility profile (orange) and MSF distribution of residues (cyan) for 
procathepsin B. The residues distinguished in panel a by their coevolutionary propensities are shown by red spheres in the ribbon diagram of 
the protein (gray/red). Note the close neighborhood of this region to the binding site of the substrate stefin A (cyan). 



C67, and G68), in close spatial proximity of substrate- 
binding site; the restricted mobility of these residues in 
the global mode suggests a signal transduction role. 

Mobility Scale of Amino Acids and Contrasting 
Mobility and Conservation Propensities 
We developed an automated procedure for identifying the 
sites distinguished by their high mobility in the global 
modes, shortly referred to as A/l-sites (see Materials and 
Methods, and supplementary text S1, Supplementary Ma- 
terial online), and evaluated the propensity P M of different 
types of amino acids to take part in these sites. Figure 5 
displays the resulting distribution (orange bars), obtained 
after normalizing the results with respect to the frequency 
of occurrence of different types of residues in our data set. 
Note that P M = if the probabilistic participation in A/1-sites 
is not different from that expected from a priori frequency 
(natural occurrence) of amino acids; P M > 1 refers to 
amino acids that undergo relatively large displacements 
(based on backbone fluctuations); and P M < 1, to those 
restricted. Calculations repeated with ml and N — 1 modes 
yielded similar propensities, as shown by the respective 
dark orange and red bars. 

Figure 5a also displays the distribution of amino acids 
among the most conserved (C-) sites (green). Higher bars 
indicate higher conservation propensity. Amino acids are 
ordered along the abscissa according to their conservation 
propensity. Cysteines are most conserved, followed by His 
and Trp, and Lys, least conserved. The high level of conser- 
vation of histidines may be attributed to their unique mul- 
tidirectional proton transfer capability, which also makes 
them the most common amino acid at active sites (Betts 
and Russell 2007). Their lowest P M value compared with 
charged amino acids is probably due to aromatic stacking 
interactions that restrain their flexibility, like other aro- 
matic residues (Trp, Phe, and Tyr). In contrast, Lys and 
Glu are distinguished by high mobilities (in both global 
and local motions), whereas Cys is one of the least mobile 



residues, along with Val, lie, and Leu. The latter group usu- 
ally lies in the hydrophobic core. The mobility ranking of 
amino acids is reminiscent of hydrophobicity scales, con- 
sistent with the tendency of hydrophobic residues to be 
buried in the core and thereby have limited motions. 

The most striking observation in figure E>a is the con- 
verse mobility and conservation propensities of amino 
acids: an amino acid type with high conservation propen- 
sity P c generally has low propensity P M for large move- 
ments and vice versa. These opposite propensities are 
most pronounced at the two ends of the spectrum. 
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Fig. 5. Mobility, conservation, and coevolution propensities of 
amino acids, (a) Distributions of amino acids within the subsets 
composed of highly conserved (C-) (green bars) and highly mobile 
(A/1-) sites (light-to-dark orange bars, based on ml, ml, or N — 1 
modes, as labeled). The bars represent the propensities with respect 
to those expected a priori based on the frequency of occurrence of 
the particular amino acid types in the data set. (b) Coevolution 
propensities of amino acids based on Ml (light blue) and Mlp (dark 
blue) values, as labeled. Amino acid types (shown by one-letter 
codes) are listed in the order of decreasing entropy in both panels. 
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Coevolution Propensity of Amino Acids 
The coevolution propensities, P E , of amino acids are pre- 
sented in figure 5b. The propensities were based on the 
strongest signals observed in the Ml (light blue) and 
Mlp maps (dark blue). The two scales exhibit similar prop- 
erties, suggesting that the composition of the subset of res- 
idues that yields the strongest coevolution signals (E-sites; 
see supplementary text S1, Supplementary Material online) 
is relatively insensitive to the choice of the metric, Ml or 
Mlp. For ease of comparison, the amino acids along the 
abscissa are listed in the same order as panel a. 

Comparison of the histograms in figure 5b with those in 
panel a shows that the coevolutionary propensities of 
amino acids are practically independent of their conserva- 
tion or mobility scales. Proline exhibits the highest coevo- 
lution propensity (based on Mlp values). Trp is also 
distinguished by its high tendency to take part in correlated 
mutations. This is presumably due to its large size and its 
ability, along with other aromatic residues such as tyrosine, 
to make specific interactions (e.g., aromatic-guanidinium 
interactions with Arg) at protein-protein interfaces 
(Crowley and Golovin 2005). Other residues distinguished 
by their high coevolutionary tendencies are Cys and Tyr, 
which, similarly to Trp, are usually conserved (see panel 
a) and presumably involved in specific interactions unable 
to sustain substitutions unless compensated by a correlated 
mutation. 

Polar residues, on the other hand, represent a unique 
group because of their relatively high coevolvability and 
high mobility. Ser, Asn, and Arg (a charged but versatile 
residue that has both hydrophobic and polar moieties) 
lie in this group. Their combined coevolution propensity 
and conformational mobility suggests that these particular 
amino acids are suitably recruited by proteins at substrate 
recognition sites being at the same time specific and flex- 
ible enough to mediate substrate selectivity. 

Determinants of Sequence Conservation 
The present study shows that amino acids constrained in 
the collective motions (especially in the global modes) of 
enzymes tend to be conserved. However, sequence conser- 
vation may be attributed to various sources. The observed 
correlation may not exclusively arise from dynamic require- 
ments but could be a manifestation of other functional or 
stability requirements, which in turn impose constraints on 
the dynamics, and on sequence identity. For example, cat- 
alytic residues are usually conserved, and so are metal- 
binding residues, due to their unique biochemical activities 
that are achieved only under well-defined coordination ge- 
ometries, hence the need to stabilize specific poses/orien- 
tations of side chains at substrate/ligand-binding site. Our 
earlier studies demonstrated that these regions exhibit 
minimal motions in the global modes, consistent with 
these requirements (Yang and Bahar 2005; Dutta and Bahar 
2010). 

In the same way, one might think that the most mobile 
regions are usually solvent exposed, and those buried, 



forming the hydrophobic core, tend to be highly stable/im- 
mobile (and conserved). GNM mobilities are by definition 
(via the Kirchhoff connectivity matrix that describes the 
native contact topology) dependent on local packing 
density (Bahar et al. 1997; Haliloglu et al. 1997). Comparison 
of the relative area of solvent accessibility (RASA) 
(Fraczkiewicz and Braun 1998) and GNM mobilities for 
all amino acids in our data set yielded, for example, a cor- 
relation coefficient of 0.54. Not surprisingly RASA values 
also yielded a correlation of 0.43 (see supplementary table 
S5, Supplementary Material online) with conservation 
levels (Shannon entropies). 

It may therefore be hard, if not impossible, to ascribe the 
observed conservation behavior to a "single" determinant, 
such as structural constraints (core, secondary structure or 
underlying local/specific interactions), mechanical/dy- 
namic role (e.g., a global hinge), specific functional role 
(e.g., catalysis, ligand coordination), especially when inter- 
actions, structure, and dynamics are themselves interde- 
pendent. However, the present analysis permits us to 
make a first assessment of the potential relevance of ob- 
served correlations to collective dynamics upon examina- 
tion mobility profiles. 

As an example, let us consider the procathepsin B cat- 
alytic domain. As shown in figure 4a, the occluding loop 
113-117 is distinguished by its remarkably high coevolu- 
tion propensity. This loop is highly exposed (with RASA 
values of 94.1%, 65.9%, 90.8%, 51.6%, and 46.8% for the re- 
spective residues). Its high mobility in the global dynamics 
of the enzyme is manifested by the peak in the <A/I>| m1 
profile (fig. 4b). On the other hand, other solvent-exposed 
residues (e.g., 134-137, shown by the green spheres in 
supplementary fig. S8, panel b, Supplementary Material 
online) have even higher RASA values (90%, 100%, 
25.8%, and 100%) but exhibit significantly lower mobility. 
Interestingly, the former group modulates substrate bind- 
ing and the latter does not. The dynamic character of the 
former group presumably confers optimal substrate-bind- 
ing ability. Not surprisingly, the same residues 1 1 3-1 1 7 are 
distinguished as coevolving residues, residues 134-137 are 
not. This example illustrates a case where global dynamics, 
rather than solvent exposure, correlates with evolutionary 
behavior. 

Toward a more systematic examination of the origin of 
observed correlations, we focused on the C-sites (subset of 
most conserved 1,700 residues, —20% of the complete set 
of 8,254 residues, with S(i) < 0.65; see fig. 3a), and we ex- 
amined their solvent exposure and global mobility. To this 
aim, we first extracted two subsets of residues of the same 
size: 1) dynamically restrained residues usually serving as 
hinge centers in the global modes, designated as H-sites, 
and 2) those exhibiting the lowest solvent exposure, that 
is, buried residues or B-sites. The members of these respec- 
tive subsets are found by rank ordering the normalized mo- 
bilities and normalized RASA values, starting from lowest 
values and simply selecting the top-ranking 20% in each list. 
Next we examined the overlap between the three subsets 
of conserved (C), dynamically restrained (H), and buried (B) 



2260 



Sequence Evolution Correlates with Structural Dynamics • doi:10.1093/molbev/mss097 



MBE 



(a) (b) | L230-R268 ] 




Fig. 6. Conserved sites distinguished by minimal fluctuations in global 
modes, despite moderate-to-high exposure to solvent. The figure 
illustrates four cases: (a) Staphyloccocal nuclease (PDB id: 1kab), (b) 
exonuclease III (PDB id: 1ako), (c) phospholipase C (PDB id: 2ffz), and 
(d) dehydrofolate reductase (PDB id: 3cd2). The labeled residues 
displayed in red space-filling representations simultaneously belong to 
the C- and H-subsets (of highly conserved and dynamically restrained 
residues) but not to the B-subset (of most buried residues). The 
identities of these residues and substructures whose collective 
dynamics they delimit are indicated by the labels (color coded after 
the substructures). The orange, space-filling residues in panel 
a illustrate a pair of residues that are highly conserved and buried 
(but globally moving as part of the violet substructure). 



residues. The C-subset contained 706 B-sites and 531 
H-sites (258 of which shared with the B-sites), and 721 
'other' residues. Because of the insensitivity of effective mo- 
bilities to sequence entropies in the regime S(/) < 0.8 
(fig. 3a), we did not expect a strong overlap between 
the C- and H-subsets. Yet, we still detect enrichments 
of 2.1 and 1.6, respectively, for the B- and H-sites in the 
C-subset, compared with random expectations. This statis- 
tical analysis thus shows that the majority of conserved res- 
idues are highly constrained, either structurally (B-sites) 
or dynamically (H-sites), and the two subsets of B- and 
H-residues are also interdependent. 

Of interest were the 273 H-residues whose evolutionary 
conservation cannot be ascribed to their extent of burial 
but to their potentially hinge-bending role in the collective 
dynamics encoded by the architecture. Figure 6 illustrates 
a few such cases in four enzymes. The highlighted residues 
therein serve as 'anchors' or hinge centers for modulating 
the concerted motion of substructures (indicated by differ- 
ent colors; see caption). We note that the highly conserved 
H-sites tend to be located at the loop regions or at the ter- 
mini of secondary structural elements, in contrast to buried 
residues that usually belong to secondary structural ele- 
ments that make tertiary contacts (e.g., G107 and A132, 
orange space-filling representation, in panel a). 



Conclusion 

Several recent studies have highlighted the significance of 
collective dynamics in achieving biological functions or en- 
abling biochemical activities. Yet, in previous studies, em- 
phasis has been usually on the evolutionary pressure 
originating from "structure stabilization" requirements. 
For example, a designable protein has been viewed as 
one that can sustain many substitutions while maintaining 
its structure (Li et al. 1996; Leelananda et al. 2011). In a re- 
cent excellent review, the need to retain functional inter- 
actions, in addition to conserving the architecture, has 
been pointed out (Worth et al. 2009). The present system- 
atic analysis, driven by the need to unravel the correlation, 
if any, between sequence conservation and intrinsic dy- 
namics shows that regions severely constrained in global 
modes also tend to retain/conserve their amino acid iden- 
tity; conversely, the most mobile regions are subject to the 
largest sequence variations. 

These observations raise questions with regard to the 
origin, or causality, of the observed correlations. It is not 
clear whether sequence variations (that are not necessarily 
functional) are allowed because of the intrinsic mobility of 
the structure at those particular regions, or whether, on the 
contrary, sequence variations (or coevolution) arise from 
adaptability requirement at certain regions (e.g., ubiquitous 
recognition sites), which, in turn, selectively stabilize the 
particular folds that confer suitable flexibility at those func- 
tional sites. A large number of sequence variations at highly 
flexible regions are presumably neutral. However, some are 
accompanied by compensating mutations; and those co- 
evolving pairs (or even clusters) of amino acids at regions 
distinguished by their uniquely high mobilities in the global 
modes are noted here to be involved in substrate recogni- 
tion, suggesting that their behavior is driven by functional 
requirements. Evidently, not all observed sequence corre- 
lations are the result of functional interactions. Some may 
simply originate from shared ancestry. To eliminate such 
effects in the detection of amino acid coevolutions, we 
adopted the Mlp method introduced by Dunn et al. 
(2008) and widely used in the literature (see, e.g., Buslje 
et al. [2009] or Dutheil [2012] for a recent extensive study). 

The predisposition of highly mobile and coevolving res- 
idues to serve as substrate recognition sites, also noted in 
previous studies (Liu et al. 2010; Sakarya et al. 2010), sup- 
ports the notion that substrate binding entails the confor- 
mational adaptability and physicochemical specificity of 
recognition sites (Luque and Freire 2000; Dobbins et al. 
2008) prior to stabilization by conserved interactions at 
the binding epitope. It is widely accepted that the stabili- 
zation of the bound ligand is achieved by residues con- 
served within families or subfamilies. However, prior to 
binding, the first step is recognition, and mobility/coevo- 
lution of the recognition sites appears to be a design prin- 
ciple to accommodate the geometry and chemistry of the 
substrate (Lovell and Robertson 2010; Mittag et al. 2010). 
Our analysis reveals which amino acids have high coevolu- 
tion propensities along with enhanced mobilities to 
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satisfactorily fulfill these requirements. Arg, Met, and polar 
residues are distinguished in this respect as versatile medi- 
ators of interactions with specific substrates. We also noted 
that there is another, somewhat less prominent, group of 
coevolving amino acids, which appears to be assisting con- 
served residues in either binding the substrate or coordi- 
nating cooperative responses, and this group has, in 
contrast to the former group, relatively suppressed mobi- 
lities in the global modes. 

The reaction at the active site of an enzyme usually re- 
quires high precision; catalytic residues need to be accu- 
rately positioned and oriented and highly conserved to 
achieve chemical specificity (Sacquin-Mora and Lavery 
2006; Dutta and Bahar 2010). Conserved residues that serve 
as folding nuclei also need to be highly stable (Mirny and 
Shakhnovich 1999). The observed conservation may thus 
be determined by functional (e.g., catalytic) or structural 
(e.g., stability) requirements, rather than structural dynam- 
ics. However, systematic examination of the structural and 
dynamic properties of conserved residues shows that it is 
possible to identify sites whose conservation appears to be 
exclusively associated with dynamic (e.g., global hinge) role, 
among other conserved sites. 

The evolutionary versus dynamic properties of binding 
sites may depend on the size and specificity of the sub- 
strate, whether it is a small molecule (e.g., ATP) or a biopoly- 
mer (e.g., protein). The two types of interactions have been 
shown to exhibit distinct structural properties: the former 
is conserved and almost rigid, whereas the latter tend to 
exhibit correlated mutations and higher mobility (Jones 
and Thornton 1997; Liu et al. 2010). Preorganization of con- 
served residues with restricted mobility has been suggested 
to help in stabilizing the bound conformer with minimal 
entropic penalty (Yogurtcu et al. 2008), whereas in the 
opposite case of high mobility, the favorable enthalpic in- 
teraction with the binding partner may more than com- 
pensate the unfavorable entropic contribution provided 
that the interaction surface is large enough (protein-pro- 
tein interactions). Insights into such design properties 
may be gained by performing similar investigations for dif- 
ferent classes of complexes. Interfacial residues of obligate 
pairs are more conserved than that of transient pairs, or 
alternatively, they contain correlated mutations (Mintseris 
and Weng 2005), although the distinctive dynamics of 
these two classes have yet to be established. Likewise, al- 
though the present analysis has been performed for en- 
zymes, it remains to be seen if/how the observations 
hold for other classes, including in particular membrane 
proteins whose growing number of structures is expected 
to soon lend themselves to systematic analyses. 

Finally, as suggested in a recent study (Poelwijk et al. 
2007), approaches that explicitly incorporate structure, 
function, and fitness are likely to bring a new perspective 
to molecular evolution research, beyond the insights 
gained from comparative analyses of sequence variations. 
The present study is another step toward that direction, 
which takes advantage of the advances made in recent 
years in structure-based assessment of collective dynamics 



and accessible paths of structural changes, using coarse- 
grained network models for proteins' architectures. 

Supplementary Material 

Supplementary tables S1-S5, figures S1-S8, and text S1 are 
available at Molecular Biology and Evolution online (http:// 
www.mbe.oxfordjournals.org/). 
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